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Abstract 

Spatial  outliers  represent  locations  which  are  significantly  different  from  their 
neighborhoods  even  though  they  may  not  be  significantly  different  from  the  entire 
population.  Identification  of  spatial  outliers  can  lead  to  the  discovery  of  unex¬ 
pected,  interesting,  and  implicit  knowledge,  such  as  local  instability.  In  this  paper, 
we  first  provide  a  general  definition  of  S'-outliers  for  spatial  outliers.  This  defini¬ 
tion  subsumes  the  traditional  definitions  of  spatial  outliers.  Second,  we  charac¬ 
terize  the  computation  structure  of  spatial  outlier  detection  methods  and  present 
scalable  algorithms.  Third,  we  provide  a  cost  model  of  the  proposed  algorithms. 
Finally,  we  provide  experimental  evaluations  of  our  algorithms  using  a  Minneapolis- 
St.  Paul(Twin  Cities)  traffic  data  set. 

Keywords:  Outlier  Detection,  Spatial  Data  Mining,  Scalable  Algorithm  for  Outlier 
Detection 
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1  Introduction 

Global  outliers  have  been  informally  defined  as  observations  in  a  data  set  which  appear 
to  be  inconsistent  with  the  remainder  of  that  set  of  data  [2],  or  which  deviate  so  much 
from  other  observations  so  as  to  arouse  suspicions  that  they  were  generated  by  a  different 
mechanism  [9].  The  identification  of  global  outliers  can  lead  to  the  discovery  of  unex¬ 
pected  knowledge  and  has  a  number  of  practical  applications  in  areas  such  as  credit  card 
fraud,  athlete  performance  analysis,  voting  irregularity,  and  severe  weather  prediction. 
This  paper  focuses  on  spatial  outliers,  i.e.,  observations  which  appear  to  be  inconsistent 
with  their  neighborhoods.  Detecting  spatial  outliers  is  useful  in  many  applications  of  ge¬ 
ographic  information  systems  and  spatial  databases  [23,  24].  These  application  domains 
include  transportation,  ecology,  public  safety,  public  health,  climatology,  and  location 
based  services. 

We  model  a  spatial  data  set  to  be  a  collection  of  spatially  referenced  objects,  such  as 
houses,  roads,  and  traffic  sensors.  Spatial  objects  have  two  distinct  categories  of  dimen¬ 
sions  [27]  along  which  attributes  may  be  measured.  Categories  of  dimensions*  of  interest 
are  spatial  and  non-spatial.  Spatial  attributes  of  a  spatially  referenced  object  includes 
location,  shape,  and  other  geometric  or  topological  properties.  Non-spatial  attributes  of 
a  spatially  referenced  object  include  traffic-sensor-identifiers,  manufacturer,  owner,  age, 
and  measurement  readings.  A  spatial  neighborhood  [27]  of  a  spatially  referenced  ob¬ 
ject  is  a  subset  of  the  spatial  data  based  on  a  spatial  dimension,  e.g.,  location.  Spatial 
neighborhoods  may  be  defined  based  on  spatial  attributes,  e.g.,  location,  using  spatial 
relationships  such  as  distance  or  adjacency.  Comparisons  between  spatially  referenced 
objects  are  based  on  non-spatial  attributes. 

A  spatial  outlier  is  a  spatially  referenced  object  whose  non-spatial  attribute  values 
are  significantly  different  from  those  of  other  spatially  referenced  objects  in  its  spatial 
neighborhood.  Informally,  a  spatial  outlier  is  a  local  instability  (in  values  of  non-spatial 
attributes)  or  a  spatially  referenced  object  whose  non-spatial  attributes  are  extreme  rel¬ 
ative  to  its  neighbors,  even  though  they  may  not  be  significantly  different  from  the  entire 
population.  For  example,  a  new  house  in  an  old  neighborhood  of  a  growing  metropolitan 
area  is  a  spatial  outlier  based  on  the  non-spatial  attribute  house  age. 

In  this  paper,  we  provide  a  general  definition  of  spatial  outliers  and  propose  effi¬ 
cient  spatial  outlier  detection  algorithms.  We  provide  cost  models  for  outlier  detection 
algorithms,  and  compare  alternative  underlying  data  clustering  methods.  We  also  exper¬ 
imentally  evaluate  the  proposed  algorithm  using  a  Minneapolis-St.  Paul  (Twin  Cities) 
traffic  data  set. 


*Examination  of  other  categories  of  dimensions,  e.g.,  temporal,  is  beyond  the  scope  of  this  paper  and 
may  be  explored  in  future  work. 
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1.1  An  Illustrative  Application  Domain 

The  Traffic  Management  Center  [18]  Freeway  Operations  group  archives  traffic  mea¬ 
surements  from  the  freeway  system  in  Minneapolis-St.  Paul  (Twin  Cities).  The  sensor 
network  includes  about  nine  hundred  stations,  each  of  which  contains  one  to  four  loop 
detectors,  depending  on  the  number  of  lanes.  Sensors  embedded  in  the  freeways  mon¬ 
itor  the  volume  of  traffic  on  the  road.  At  regular  intervals,  this  information  is  sent  to 
the  Traffic  Management  Center  for  operational  purposes,  e.g.,  ramp  meter  control,  and 
research  on  traffic  modeling  and  experiments. 

In  this  application,  each  station  is  a  spatially  referenced  object  with  spatial  at- 
tributes(e.g.,  location)  and  non-spatial  attributes(e.g.,  measurements).  Spatial  arrange¬ 
ment  of  stations  can  be  modeled  as  a  spatial  graph  [25].  A  directed  edge  from  station 
Si  to  station  S2  indicates  the  existence  of  a  road  segment  allowing  traffic  to  move  from 
Si  to  S2.  This  graph  is  called  a  spatial  graph  because  nodes,  i.e.,  stations,  are  located 
in  a  Euclidean  space  [27]  where  each  node  has  a  location  specified  by  coordinates,  e.g., 
<highway,  mile  point>.  The  non-spatial  attributes  include  sensor-id  and  traffic  measure¬ 
ments  (e.g.,  volume,  occupancy).  We  are  interested  in  discovering  the  location  of  stations 
whose  measurements  are  inconsistent  with  those  of  their  neighbors.  This  spatial  outlier 
detection  task  is  formalized  as  follow. 

Let  the  traffic  sensors  constitute  a  collection  of  spatially  referenced  objects.  The 
location  of  a  sensor  represents  a  spatial  attribute  and  is  represented  by  the  symbol  x. 
A  traffic  measurement  (e.g.,  volume)  constitutes  a  non-spatial  attribute  space  and  is 
represented  as  f{x).  The  neighborhood  of  x,  N{x),  is  the  set  of  traffic  sensors  adjacent 
to  the  sensor  located  at  x.  We  note  that  the  neighborhood  relationship  is  based  on 
directed  edges  in  the  underlying  spatial  graph.  Thus  sensors  on  opposite  sides  (e.g., 
I-35W  north  bound  and  I-35W  south  bound)  are  not  neighbors  even  if  the  pairwise 
Euclidean  distance  is  small.  A  sensor  is  compared  to  its  neighborhood  using  the  function 
S{x)  =  [f{x)  —  (/(?/))],  where  f{x)  is  the  attribute  value  for  a  location  x,  N{x) 

is  the  set  of  neighbors  of  x,  and  (/(?/))  is  the  average  attribute  value  for  the 

neighbors  of  x.  The  statistic  function  S{x)  denotes  the  difference  of  the  attribute  value 
of  a  sensor  located  at  x  and  the  average  attribute  value  of  x's  neighbors. 
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Figure  1:  Example  of  a  Spatial  Statistic 

Example:  We  illustrate  the  computation  of  the  spatial  statistic  S(x)  using  an 

example,  as  shown  in  Figure  1.  Consider  location  x^  in  which  the  attribute  value 
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/(xa)  =  16,  Xa’s  neighborhood  set  N{x3)  =  {x2,X4},  the  average  neighborhood  at¬ 
tribute  value  =  IMM  =  22,  and  the  spatial  statistic  function  S{x)  = 

[f{x)  -  Ey^N^^){f  {y))]  =  16  -  22  =  -6. 

Theorem  1  Spatial  Statistic  S{x)  is  normally  distributed  if  the  attribute  value  f{x)  is 
normally  distributed. 

Proof:  The  formal  proof  is  available  in  Appendix  A. 

A  popular  test  for  detecting  spatial  outliers  for  normally  distributed  f{x)  can  be 
described  as  follows:  Spatial  statistic  >  0  Por  each  location  x  with  an 

attribute  value  f{x),  the  S{x)  is  the  difference  between  the  attribute  value  at  location  x 
and  the  average  attribute  value  of  x's  neighbors,  /ig  is  the  mean  value  of  S{x),  and 
is  the  value  of  the  standard  deviation  of  S{x)  over  all  stations.  The  choice  of  9  depends 
on  a  specified  confidence  level.  For  example,  a  confidence  level  of  95  percent  will  lead  to 
6^2. 

The  assumption  of  a  normal  distribution  for  the  f{x)  and  S{x)  functions  can  be 
tested  in  our  traffic  data  set.  In  this  data  set,  the  volume  values  of  all  stations  at 
each  time  slot  are  approximately  a  normal  distribution.  A  histogram  of  the  numbers  of 
stations  for  different  intervals  of  a  non-spatial  attribute,  volume,  is  shown  in  Figure  2(a), 
where  a  normal  probability  distribution  curve  is  superimposed  on  the  histogram.  The 
normal  distribution  seems  to  approximate  the  volume  distribution  reasonably  well.  We 
calculated  the  interval  of  [y,  —  a,  ii  +  a],  [y  —  2a,  y -|-  2a] ,  and  [y  —  3a,  y -|-  3a]  where  y  and 
a  are  the  mean  and  standard  deviation  of  the  volume  distribution,  and  the  percentages 
of  measurements  falling  in  the  three  intervals  are  equal  to  68.27%,  95.45%,  and  99.73%, 
respectively.  These  values  are  quite  close  to  the  corresponding  values(68%,  95%,  and 
100%)  for  a  normal  distribution  [4].  Moreover,  we  plot  the  normal  probability  plot  in 
Figure  2(b),  and  it  appears  linear.  Hence  the  values  of  the  non-spatial  attribute  volume 
for  all  stations  at  the  same  time  are  approximately  a  normal  distribution.  The  difference 
function,  S{x)  =  [f{x)  —  Ey^pf(^x-^  (/(?/))]:  which  computes  the  difference  of  volume  and  the 
average  volume  of  corresponding  neighbors,  also  seems  normally  distributed,  as  shown 
in  Figure  2(c).  Given  the  confidence  level  100(1-q;)%,  we  can  calculate  the  confidence 
interval  for  the  difference  distributions,  i.e.,  the  difference  value  distribution  lies  between 
the  —Za/2  and  Za/2  standard  deviation  of  the  mean.  Those  stations  with  a  spatial  statistic 
Zs(x)  value  (standardized  S{x))  greater  than  Zai2  or  less  than  —Zai2  are  classified  as  spatial 
outliers. 

1.2  Definition  of  S-Outliers 

Consider  a  spatial  framework  SE  =<  S,NB  >,  where  S'  is  a  set  of  locations 
{si,  52,  •  •  •  5  and  NB  :  S  x  S  — >■  {True,  Ealse}  is  a  neighbor  relation  over  S.  We 
define  a  neighborhood  N{x)  of  a  location  x  in  S  using  NB,  specifically  N{x)  =  {y  \  y  ^ 
S ,  NB{x,  y)  =  True}. 
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(c)  Histogram  of  volume 
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Figure  2:  Verification  of  normal  distribution  for  traffic  volumes  and  volume  difference 
over  neighbors 


Definition:  An  object  O  is  an  5'-outlier(/,  Fdiff,  ST)  if  ST{Fdiff[f{x), 

/assr(/(^)i^(^))]}  true,  where  f  :  S  ^  R  is  an  attribute  function,  — >■  i?  is 

an  aggregation  function  for  the  values  of  /  over  neighborhood,  i?  is  a  set  of  real  numbers, 
Fdiff  :  R  X  R  ^  R  is  a  difference  function,  and  ST  :  R  {True,  False}  is  a  statistic 
test  procedure  for  determining  statistical  significance. 

Example  1.  The  spatial  outliers  defined  in  Section  1.1  are  examples  of  5'-outliers. 
We  can  define  respective  components  in  the  traffic  application  domain  as  follows.  The  / 
is  the  non-spatial  attribute,  namely,  traffic  volume.  The  neighborhood  aggregate  function 
faggri^)  —  ^yeN{x){f  {ij))  is  the  average  attribute  value  function  over  neighborhood  N{x). 
The  difference  function  Fdiff  (x)  is  S{x)  =  [f{x)  —  (/(?/))],  i-e.,  the  arithmetic 

difference  between  attribute  function  f{x)  and  neighborhood  aggregate  function  faggri^)- 
Let  iis(x)  and  as(x)  be  the  mean  and  standard  deviation  of  the  difference  function  Fdiff] 

then  the  significance  test  function  ST  can  be  defined  as  Zg^x)  =  \ — ^ 

Example  2.  A  DB{p,  D)-ont\ier  [15]  is  also  an  example  of  an  iS-outlier.  For  a  k 
dimensional  data  set  T  with  N  objects,  an  object  O  in  T  is  a  DB{p,  D)-outlier  if  at  least 
a  fraction  p  of  the  objects  in  T  lies  greater  than  distance  D  from  O  [15].  Assuming 
is  the  number  of  objects  within  distance  D  from  object  O,  the  statistical  test  function 
ST  can  be  defined  as  Total  number  of  objects)  f  (x)  ^  Dil-outlier  subsumes  many 

other  definitions  of  global  outliers  [15]. 

1.3  Contribution,  Outline,  and  Scope 

This  paper  provides  a  general  definition  of  spatial  outliers  and  shows  that  various  tests 
for  detecting  spatial  outliers  are  special  cases.  We  identify  the  basic  spatial-self-join  com- 
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putational  structure  for  the  scalable  implementation  of  spatial  outlier  tests  and  recognize 
clustering  methods  to  be  the  primary  design  decision  influencing  the  total  computational 
cost.  We  also  provide  efficient  strategies  to  implement  a  spatial  outlier  detection  test 
and  evaluate  our  method  in  a  Twin-Cities  traffic  data  set  to  show  its  effectiveness  and 
usefulness. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  reviews  related  work  and  dis¬ 
cusses  our  contributions.  In  Section  3,  we  discuss  the  computation  structure  for  detecting 
spatial  outliers  and  propose  our  general  outlier  detection  algorithms.  The  cost  models 
for  proposed  algorithms  are  analyzed  in  Section  4.  Section  5  presents  our  experiment 
design.  The  experimental  observation  and  results  are  shown  in  Section  6.  We  summarize 
our  work  and  describe  the  future  direction  of  our  research  in  Section  7. 

This  paper  focuses  on  spatial  outlier  detection  using  a  single  attribute.  Outlier  de¬ 
tection  in  multi-dimensional  space  using  multiple  attributes  is  beyond  the  scope  of  this 
paper. 


2  Related  Work 

Many  outlier  detection  algorithms  [1,  2,  3,  13,  14,  20,  22,  28]  have  been  recently  proposed. 
As  shown  in  Figure  3(a),  these  methods  can  be  broadly  classified  into  two  categories, 
namely  one-dimensional(linear)  outlier  detection  methods  and  multi-dimensional  outlier 
detection  methods.  The  one-dimensional  outlier  detection  algorithms  [2,  10]  consider  the 
statistical  distribution  of  non-spatial  attribute  values,  ignoring  the  spatial  relationships 
between  items.  Numerous  outlier  detection  tests,  known  as  discordancy  tests  [2,  10], 
have  been  developed  for  different  circumstances,  depending  on  the  data  distribution,  the 
number  of  expected  outliers,  and  the  types  of  expected  outliers.  The  main  idea  is  to  fit 
the  data  set  to  a  known  standard  distribution,  and  develop  a  test  based  on  distribution 
properties.  We  use  an  example  to  illustrate  the  differences  among  one-dimensional  and 
multidimensional  outlier  detection  methods.  In  Figure  4(a),  the  W-axis  is  the  location 
of  data  points  in  one  dimensional  space;  the  Y-axis  is  the  attribute  value  for  each  data 
point.  One-dimensional  outlier  detection  methods  ignore  the  spatial  location  of  each  data 
point,  and  fit  the  distribution  model  to  the  values  of  the  non-spatial  attribute.  The  outlier 
detected  using  a  one-dimensional  approaches  is  the  data  point  G,  which  has  an  extremely 
high  attribute  value  7.9,  exceeding  the  threshold  of  /r  -|-  2a  =  4.49  -|-  2  *  1.61  =  7.71,  as 
shown  in  Figure  4(b).  This  test  assumes  a  normal  distribution  for  attribute  values. 

Multi-dimensional  outlier  detection  methods  can  be  further  grouped  into  two  cate¬ 
gories,  namely  homogeneous  multidimensional  metric  based  methods  and  spatial  meth¬ 
ods.  The  homogeneous  multidimensional  methods  model  data  sets  as  a  collection  of 
points  in  a  multidimensional  isometric  space,  and  provide  tests  based  on  concepts  such 
as  distance,  density,  and  convex-hull  depth.  These  methods  do  not  distinguish  between 
attribute  dimensions  and  geo-spatial  dimensions,  and  use  all  dimensions  for  defining 
neighborhood  as  well  as  for  comparison,  as  shown  in  Figure  3(b).  We  discuss  representa¬ 
tive  methods  now.  Knorr  and  Ng  presented  the  notion  of  distance-based  outliers  [13,  14]. 
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Figure  3:  Classification  and  comparison  of  outlier  detection  methods 


As  discussed  in  example  2  of  Section  1,  for  a  A:  dimensional  data  set  T  with  N  objects, 
an  object  O  in  T  is  a  DB{p,  D)-out\ier  if  at  least  a  fraction  p  of  the  objects  in  T  lies 
greater  than  distance  D  from  O.  Ramaswamy  et  al.  [21]  proposed  a  formulation  for 
distance-based  outliers  based  on  the  distance  of  a  point  from  its  nearest  neighbor. 
After  ranking  points  by  the  distance  to  its  nearest  neighbor,  the  top  n  points  are 
declared  as  outliers.  Breunig  et  al.  [3]  introduced  the  notion  of  a  “local”  outlier  in  which 
the  outlier-degree  of  an  object  is  determined  by  taking  into  account  the  clustering  struc¬ 
ture  in  a  bounded  neighborhood  of  the  object,  e.g.,  k  nearest  neighbors.  They  formally 
defined  the  outlier  factor  to  capture  this  relative  degree  of  isolation  or  outlierness.  In 
computational  geometry,  depth-based  approaches  [22,  20]  organize  data  objects  in  convex 
hull  layers  in  data  space  according  to  peeling  depth  [20],  and  outliers  are  expected  to  be 
found  from  data  objects  with  shallow  depth  values.  Yu  et  al.  [28]  introduced  an  outlier 
detection  approach,  called  FindOut,  which  identifies  outliers  by  removing  clusters  from 
the  original  data.  The  key  idea  of  this  approach  is  to  apply  signal  processing  techniques 
to  transform  the  space  and  find  the  dense  regions  in  the  transformed  space.  The  remain¬ 
ing  objects  in  the  non-dense  regions  are  labeled  as  outliers.  In  Figure  4(a),  for  example, 
the  outliers  detected  using  homogeneous  multidimensional  approaches  are  the  data  point 
D  and  T,  which  lie  in  a  low  density  area. 

Homogeneous  multidimensional  methods  have  several  limitations.  First,  they  are 
designed  to  detect  global  outliers  rather  than  spatial  outliers.  Second,  they  assume 
that  the  data  items  are  embedded  in  a  isometric  metric  space  and  do  not  distinguish 
between  non-spatial  attributes  and  spatial  attributes.  Third,  they  do  not  exploit  apriori 
information  about  the  statistical  distribution  of  attribute  data.  Last,  they  seldom  provide 
a  confidence  measure  for  the  discovered  outliers. 
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(a)  An  Example  Data  Set 


(b)  Histogram 


Figure  4:  A  Data  Set  for  Outlier  Detection 


Bi-partite  multidimensional  tests  are  designed  to  detect  spatial  outliers.  They  sep¬ 
arate  spatial  attributes  from  attribute  attributes,  as  shown  in  Figure  3(b).  Spatial  at¬ 
tributes  are  used  to  characterize  location,  neighborhood,  and  distance.  Non-spatial  at¬ 
tribute  dimensions  are  used  to  compare  a  spatially  referenced  object  to  its  neighbors.  Spa¬ 
tial  statistics  literature  provides  two  kinds  of  bi-partite  multidimensional  tests,  namely 
graphical  tests  and  quantitative  tests.  Graphical  tests  are  based  on  visualization  of  spa¬ 
tial  data  which  highlight  spatial  outliers.  Example  methods  include  variogram  clouds 
[5]  and  Moran  scatterplots  [17].  Quantitative  methods  provide  a  precise  test  to  distin¬ 
guish  spatial  outliers  from  the  remainder  of  data.  Scatterplots  [16]  are  a  representative 
technique  from  the  quantitative  family. 

A  variogram-cloud  displays  data  points  related  by  neighborhood  relationships.  For 
each  pair  of  locations,  the  square-root  of  the  absolute  difference  between  attribute  values 
at  the  locations  versus  the  Euclidean  distance  between  the  locations  are  plotted.  In 
data  sets  exhibiting  strong  spatial  dependence,  the  variance  in  the  attribute  differences 
will  increase  with  increasing  distance  between  locations.  Locations  that  are  near  to  one 
another,  but  with  large  attribute  differences,  might  indicate  a  spatial  outlier,  even  though 
the  values  at  both  locations  may  appear  to  be  reasonable  when  examining  the  data  set 
non-spatially.  Figure  5(a)  shows  a  variogram  cloud  for  the  example  data  set  shown  in 
Figure  4(a).  This  plot  shows  that  two  pairs  {P,  S)  and  {Q,S)  in  the  left  hand  side  lie 
above  the  main  group  of  pairs,  and  are  possibly  related  to  spatial  outliers.  The  point 
S  may  be  identified  as  a  spatial  outlier  since  it  occurs  in  both  pairs  {Q,S)  and  {P,S). 
However,  graphical  tests  of  spatial  outlier  detection  are  limited  by  the  lack  of  precise 
criteria  to  distinguish  spatial  outliers.  In  addition,  a  variogram  cloud  requires  non-trivial 
post-processing  of  highlighted  pairs  to  separate  spatial  outliers  from  their  neighbors, 
particularly  when  multiple  outliers  are  present  or  density  varies  greatly. 

A  Moran  scatterplot  [17]  is  a  plot  of  normalized  attribute  value  {Z[f{i)]  =  ) 
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Figure  5:  Variogram  Cloud  and  Moran  Scatterplot  to  Detect  Spatial  Outliers 


against  the  neighborhood  average  of  normalized  attribute  values  (W  ■  Z),  where  W  is  the 
row-normalized  (i.e.,  =  1)  neighborhood  matrix,  (i.e.,  >  0  iff  neighbor(i,  j)). 

The  upper  left  and  lower  right  quadrants  of  Figure  5(b)  indicate  a  spatial  association 
of  dissimilar  values:  low  values  surrounded  by  high  value  neighbors(e.g.,  points  P  and 
Q),  and  high  values  surrounded  by  low  values  (e.g,.  point  S).  Thus  we  can  identify 
points(nodes)  that  are  surrounded  by  unusually  high  or  low  value  neighbors.  These 
points  can  be  treated  as  spatial  outliers. 

Definition:  Morarioutiier  is  a  point  located  in  upper  left  and  lower  right  quadrants  of 
Moran  scatterplot.  This  point  can  be  identified  by  {Z[f{i)])  x  (^j{WijZ[f{j)]))  <  0. 

Lemma  1  Morarioutiier  is  o,  special  case  of  an  S -outlier. 

Proof:  For  a  MorangutUer,  the  difference  functions  are  Zi  and  li,  where  Zi  =  h  = 

^^.(VFjjZ[/(j)])),  and  nj  and  oj  are  the  mean  and  standard  deviation  of  the  attribute 
function  f{i).  The  statistic  test  function  ST  is  (Z[/(i)])  x  if^j(yVijZ[f{j)\))  <  0. 

A  scatterplot  [8,  16]  shows  attribute  values  on  the  V-axis  and  the  average  of  the 
attribute  values  in  the  neighborhood  on  the  V-axis.  A  least  square  regression  line  is  used 
to  identify  spatial  outliers.  A  scatter  sloping  upward  to  the  right  indicates  a  positive 
spatial  autocorrelation  (adjacent  values  tend  to  be  similar);  a  scatter  sloping  upward 
to  the  left  indicates  a  negative  spatial  autocorrelation.  The  residual  is  defined  as  the 
vertical  distance  (V-axis)  between  a  point  P  with  location  (Vj,,  Ij,)  to  the  regression  line 
Y  =  mX  +  b,  that  is,  residual  e  =  Yp  —  [mXp  -|-  h).  Cases  with  standardized  residuals, 
^standard  =  greater  than  3.0  or  less  than  -3.0  are  flagged  as  possible  spatial  outliers, 

where  and  are  the  mean  and  standard  deviation  of  the  distribution  of  the  error  term 
e.  In  Figure  6(a),  a  scatter  plot  shows  the  attribute  values  plotted  against  the  average 
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of  the  attribute  values  in  neighboring  areas  for  the  data  set  in  Figure  4(a).  The  point  S 
turns  out  to  be  the  farthest  from  the  regression  line  and  may  be  identified  as  a  spatial 
outlier. 

Definition:  Scatter  pi  is  a  point  with  significant  standardized  residual  error 

from  the  least  square  regression  line  in  a  scatter  plot.  Assuming  errors  are  normally 
distributed,  then  Cstandard  =  l^-^l  >  0  is  a  common  test.  Nodes  with  standardized 
residuals  Cgtandard  =  from  regression  line  Y  =  mX  +  b  and  greater  than  9  or  less  than 
—6  are  flagged  as  possible  spatial  outliers.  The  and  are  the  mean  and  standard 
deviation  of  the  distribution  of  the  error  term  e. 

Lemma  2  S catterplotoutHer  is  a  special  case  of  an  S -outlier. 

Proof:  For  a  S catterplotoutHer,  the  neighborhood  aggregate  function  =  E{x)  = 
i  '^yeN{x)  average  attribute  value  of  neighbors.  The  difference  function  is 

Ediff  =  e  =  E{x)  —  {{m  *  f{x))  +  6),  where  m  and  b  characterize  the  slope  and  intercept 
of  the  least  square  line  fitting  ( fix),  E(x)).  The  spatial  outliers  are  tested  using  statistic 

test  taction  ST  =  (\^\  >1) 

Figure  6(b)  shows  the  visualization  of  spatial  statistic  method  described  earlier  in 
Section  1.1  and  Example  1.  The  X-axis  is  the  location  of  data  points  in  one  dimensional 
space;  the  T-axis  is  the  value  of  spatial  statistic  Zs(x)  for  each  data  point.  We  can  easily 
observe  that  the  point  S  has  the  Zs(x)  value  exceeding  3,  and  will  be  detected  as  spatial 
outlier.  Note  the  two  neighboring  points  P  and  Q  of  S'  have  Zs(x)  values  close  to  -2  due 
to  the  presence  of  spatial  outlier  in  their  neighborhoods.  Example  1  has  already  shown 
that  Zs(x)  is  a  special  case  of  S-outlier. 
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3  Spatial  Outlier  Detection:  Problem  Definition  and 
Proposed  Algorithms 

In  this  section,  we  provide  a  formal  definition  of  the  problem  of  designing  computationally 
efhcient  techniques  for  detecting  spatial  outliers.  Earlier  sections  presented  a  definition  of 
spatial  outliers  and  showed  that  the  definition  subsumes  other  quantitative  spatial  outlier 
definitions.  Table  1  shows  examples  of  difference  function  T^j//  and  statistic  test  function 
ST  for  different  quantitative  spatial  outlier  detection  methods.  Difference  function  F^iff 
computes  parameters  that  are  used  by  statistical  test  function  ST  to  verify  the  outlierness 
of  a  node.  We  show  and  ST  functions  for  Spatial  statistic  Scatterplot,  and 

Moran  scatterplot  approaches  to  summarize  the  lemmas  presented  in  the  earlier  section. 
For  example,  in  the  scatterplot  approach,  the  difference  function  computes  the  error  term 
e,  which  is  the  value  of  the  vertical  distance  between  a  node  and  the  regression  line  in  the 
X  —  Y  plane  and  is  defined  as  F^iff  :  e  —  E{x)  —  {m*  f{x)  +  b),  where  E{x),  the  average 
attribute  value  of  neighbor  nodes  of  x,  is  the  T-axis  value;  f{x),  the  attribute  value  of 
node  X,  is  the  W-axis  value;  the  m  and  b  are  the  slope  and  intercept  of  the  scatterplot 
line  in  the  X  —  Y  plane. 

The  computation  needs  of  spatial  outlier  detection  are  divided  into  two  parts,  model 
building  and  test  result  computation.  Model  building  computes  aggregate  functions  used 
by  the  difference  function  E^iff  and  statistic  test  function  ST,  as  shown  in  the  last  row  of 
Table  1.  We  discuss  the  computation  of  the  aggregate  functions  and  propose  algorithms 
for  model  building  and  test  result  computation. 


1  Test  Computation  | 

Spatial  Outlier  Definition 

Scatterplot 

1  Moran  scatterplot  | 

Difference  function  -Fdi// 

1 

II 

e  =  E{x)  —  (m  *  f{x)  +  b) 

Statistic  test  function  ST 

iZUii)])  X 

(Ei(Wi3Z[/(j)]))  <  0 

Aggregate  function  used  in 
Fdiff  and  ST 

Ts  ,  <^s 

m,  b,  ,  <7^ 

Table  1:  Examples  of  and  ST  functions  for  different  approaches 


3.1  Problem  Definition 

Given  the  components  of  the  iS-outlier  definition,  the  objective  is  to  design  a  compu¬ 
tationally  efficient  algorithm  to  detect  the  iS-outliers.  The  components  of  the  iS-outlier 
definition  are  restricted  via  constraints  to  allow  computational  efficiency  while  preserv¬ 
ing  the  correctness  of  Lemmas  showing  that  various  existing  spatial  outlier  detection 
tests  (e.g.,  Scatterplot,  Moran  scatterplot.  Spatial  statistic  are  special  cases  of  S- 

outliers.  Thus  the  algorithms  proposed  in  this  section  are  useful  in  building  models  to 
detect  spatial  outliers  via  a  variety  of  existing  techniques.  The  following  optimization 
problem  characterizes  the  problem  of  designing  efhcient  algorithms  for  detecting  spatial 
outliers: 
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Spatial  Outlier  Detection  Problem 
Given: 

•  A  spatial  framework  S  consisting  of  locations  Si,  S2,  ■  ■  ■ ,  Sn 

•  A  neighborhood  relationship  N  C  S  x  S 

•  An  attribute  function  f  :  Si  ^  R 

•  A  neighborhood  aggregate  function  R,  where  N  is 

the  maximum  neighbor  number  for  a  location 

•  A  comparison  function  Fdiff{f,faggr) 

•  Statistic  test  function  ST  :  R  — >■  {TrMe,Fa/se} 

Design:  An  efficient  algorithm  to  detect  iS-outliers, 

i.e.,  {sj  I  Sj  G  S',  Si  isanS  —  outlier} 

Objective: 

•  Efficiency:  to  minimize  the  computation  time 

Constraints: 

•  F^iff  and  ST  are  algebraic  aggregate  functions  of 
values  of  /(x)  and 

•  The  size  of  the  data  set  is  much  greater  than  the  main  memory  size 

•  Computation  time  is  determined  by  I/O  time 

Aggregate  functions  can  be  grouped  into  three  categories,  namely,  distributive,  alge¬ 
braic,  and  holistic  [6,  26].  An  aggregate  function  F  is  called  distributive  if  there  exists  a 
function  G  such  that  the  value  of  F  for  a  data  set  can  be  computed  by  applying  a  G  func¬ 
tion  to  the  value  of  F  in  each  partition  of  the  data  set.  In  most  cases,  F  —  G.  Examples 
of  distributive  aggregate  functions  include  county  max,  and  sum,  as  shown  in  Appendix 
B.  An  aggregate  function  F  is  algebraic  if  F  of  a  data  set  can  be  computed  using  a  fixed 
number  of  distributive  aggregates  from  each  partition  of  the  data  set.  Average,  variance, 
standard  deviation,  maxN,  minN  are  all  algebraic  aggregate  functions.  Illustrations  are 
available  in  Appendix  B.  An  aggregate  function  F  is  called  holistic  if  the  value  of  F  for 
a  data  set  cannot  be  computed  using  a  constant  number  of  distributive  aggregates  from 
each  partition  of  the  data  set.  Example  of  a  holistic  aggregate  function  includes  median. 
We  note  that  algebraic  and  distributive  aggregate  functions  can  be  computed  by  a  single 
scan  of  a  data  set  even  when  the  data  set  is  too  large  to  fit  in  the  main  memory.  In 
processing  a  data  set  with  a  size  greater  than  the  size  of  memory,  extra  disk  scans  are 
required  to  calculate  the  holistic  aggregate  function. 

For  each  node,  say  x,  the  attribute  function  f{x)  contains  the  attribute  value  of  x. 
The  neighborhood  aggregate  function  computes  a  value  using  the  attribute  value  of 
X  and  the  attribute  value  of  x' s  neighboring  nodes.  The  distributive  aggregate  function 
computes  the  aggregate  value  (e.g.,  sum,  count)  of  the  attribute  value  and  neighborhood 
aggregate  value  for  all  nodes.  The  algebraic  aggregate  function  computes  the  statis¬ 
tic  values  for  all  nodes,  e.g.,  mean  and  standard  deviation,  and  can  be  derived  using 
the  values  computed  in  the  distributive  aggregate  functions.  The  comparison  function 
Fdiff  and  statistic  test  function  ST  for  the  quantitative  spatial  outlier  definition  can  be 
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computed  using  algebraic  aggregate  functions  of  values  from  f{x)  and  f^ggr-  Table  2 
shows  the  algebraic  aggregate  functions  for  different  quantitative  definitions  of  spatial 
outliers.  Each  column  shows  the  computation  structure  of  the  attribute  function,  neigh¬ 
borhood  aggregate  function,  distributive  aggregate  functions,  and  algebraic  aggregate 
functions  for  each  spatial  outlier  detection  approach.  For  example,  in  the  scatterplot 
approach,  the  attribute  function  is  /(x);  the  neighborhood  aggregate  function  is 
E{x)  =  lJ2yeN{x)  the  distributive  aggregate  functions  D^^g^  are 
Y.f{.x)E{x\  Y.P{.x\  Y.E\x)-  and  the  algebraic  aggregate  functions  A^^g^  are  the 
slope  m  and  the  intercept  h  of  the  regression  line,  and  the  standard  deviation  of  the 
error  term  e,  all  of  which  can  be  derived  using  the  distributive  aggregate  functions. 

By  utilizing  Table  2,  we  can  compute  the  algebraic  aggregate  functions  in  one  single 
scan  of  the  spatial  self-join  of  the  station  data  set  using  the  neighbor  relationship.  For 
example,  the  standard  deviation  of  the  error  term  e  in  the  scatterplot  approach  can 
be  computed  using  the  values  computed  in  the  distributive  aggregate  functions.  In  a 
naive  approach,  however,  two  data  scans  of  the  spatial  self-join  may  be  used,  where  the 
first  scan  computes  the  slope  and  intercept  of  the  regression  line,  and  the  second  scan 
calculates  the  statistic  values  (e.g.,  mean  and  standard  deviation)  of  the  error  term. 


1  Model  Building  | 

Outlier  Definition 

1  Scatterplot 

Moran  scatterplot 

Attribute  function  / 

Neighborhood  aggregate  func- 

tion  /Aar 

1 

II 

E(x)  —  ^  SyeJVfa;) 

Distributive  aggregate  functions: 

j^Gl  j^G2  j^Gk 

ES(x). 

n(count) 

n(count) 

T.SM, 

n(  count) 

Algebraic  aggregate  functions: 
aGI  aG2  AGk 

■^aggr  ’  aggr  ’  ’  aggr 

IJ.3  =  ^  ■  ‘'a  = 

N  >;/(»)!!:(»)->;/(»)  >>'(») 

,  /(»)£;(») 

b  ‘'Z  - 

=  0,  o-£  =  / gi/v  ,  where 

=  E/A^)  -  Syy  = 

Table  2:  Model  building  to  compute  the  aggregate  functions 


3.2  Our  Approach 

The  computational  task  in  the  spatial  outlier  detection  problem  can  be  divided  into  two 
subtasks:  a)  design  an  efficient  computation  method  to  compute  the  global  statistical 
parameters  using  a  spatial  join  and  b)  test  whether  spatial  locations  on  a  given  path 
are  outliers.  The  first  task  is  called  model  building;  the  second  task  is  called  test  result 
computation. 

3.2.1  Model  Building 

An  I/O  efficient  model  building  algorithm  computes  the  algebraic  aggregate  functions, 
e.g.,  the  mean  and  standard  deviation,  in  a  single  scan  of  a  spatial  self-join  from  a  spatial 
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data  set  using  a  neighbor  relationship.  The  computed  values  from  the  algebraic  aggregate 
functions  can  be  used  by  the  difference  function  and  statistic  test  function  ST  to 
validate  the  outlierness  of  an  incoming  data  set.  Algorithm  1  shows  the  steps  of  the 
Model  Building  algorithm.  In  the  first  step,  the  algorithm  retrieves  the  neighbor  nodes 
for  each  data  object,  say  x;  then  it  computes  the  neighborhood  aggregate  function  faggr- 
The  distributive  aggregate  functions  are  then  aggregated  using  the  attribute  function 
f{x)  and  the  neighborhood  aggregate  function  f^ggr-  Finally,  the  algebraic  aggregate 
functions  are  computed  using  the  values  from  the  distributive  aggregate  functions.  Note 
that  the  data  objects  are  processed  on  a  page  basis  to  reduce  redundant  I/O.  In  other 
words,  all  the  nodes  within  the  same  disk  page  are  processed  before  retrieving  the  nodes 
of  the  next  disk  page. 

Model  Building  Algorithm 

Input:  5  is  a  spatial  framework; 

/  is  an  attribute  function; 

N  is  the  neighborhood  relationship; 

faggr  neighborhood  aggregate  function; 

T>aggr^T)^ggr^---Jaggr  are  the  distributive  aggregate  functions; 

Output:  Algebraic  aggregate  functions  , . . . ,  A®*  ^ 

for(i=l;i  <  |5|  ;i++){ 

Oi=Get_One_Object(i,S) ;  /*  Select  each  object  from  S  */ 

NNS=FindJIeighborJIodes_Set(Oi,A^,5) ;  /*  Find  neighbor  nodes  of  Oi  from  S  */ 

for(j=l;j<  |NSS|;j++){ 

Oj=Get_One_Object(j  ,NNS) ;  /*  Select  each  neighbor  of  Oi  */ 
faggr  =  Compute_and_Aggregate (/(Oi), /(Oj) )  ; 

} 

/*  Add  the  element  to  global  aggregate  functions  */ 

Aggregate_Element  D^fggr,  •  •  • ,  T aggri  f aggr  >  t)  , 

} 

/*  Compute  the  algebraic  aggregate  functions  */ 

<  ^aggr^^^fggr,  •  •  • ,  ^‘fggr  >  =  ComputeAlgebraic_Aggregate . . . ,  ; 

return  ( A^/g^ ,  A^^^^ , . . . ,  A^/^^ )  . 

Algorithm  1:  Pseudo-code  for  model  construction 

Assuming  each  node  has  k  neighbors,  the  first  operation  Get_One_Object()  retrieves 
data  points  on  the  basis  of  the  disk  page,  thus  reducing  redundant  disk  I/O  operation. 
The  FindJMeighborJModes_Set()  operation  retrieves  the  data  records  of  the  k  neighbors 
of  the  current  processing  node  .  If  the  neighbor  nodes  are  not  in  the  memory  buffer,  extra 
I/O  operations  are  required  to  retrieve  the  disk  pages  which  contain  the  data  records  of 
the  neighbor  nodes.  The  time  needed  to  locate  and  transfer  a  disk  block  to  memory 
buffer  is  in  the  order  of  milliseconds,  usually  ranging  from  15  to  60  msec.  Therefore, 
the  operation  FindJMeighborJModes_Set()  dominates  the  computation  time.  The  I/O 
cost  of  FindJMeighborJModes_Set()  is  determined  by  the  clustering  efficiency (CE),  that 
is,  how  the  nodes  are  grouped  into  disk  pages.  If  a  node  and  all  of  its  neighbor  nodes  can 


13 


be  arranged  in  the  same  disk  page,  no  redundant  I/O  operation  will  be  required.  The 
execution  time  for  each  data  object  can  be  estimated  as  follows. 

For  instance,  we  assume  a  500MHz  machine,  and  that  the  Cycle  Per  Instruction(CPI) 
is  5,  the  instruction  count  is  150,  the  number  of  neighbors  for  each  data  node  is  10,  that 
each  disk  I/O  operation  takes  15*  10^^  sec  (typical:  15  -  60  msec),  and  that  CE  denotes 
the  clustering  efficiency.  For  each  data  record,  the  execution  time  will  be  CPU  time  + 
(I/O  time)*(l-CE)*k  =  150  *  5  *  =  1-5  *  10^®  +  1.5  *  10^^  *  (1— CE)  sec.  The  CE 

value  determines  the  execution  time.  The  higher  the  CE  value,  the  shorter  the  execution 
time.  We  will  formally  define  clustering  efficiency  in  the  following  section. 

Lemma  3  Algebraic  aggregate  functions  needed  by  the  difference  function  F^iff  and 
statistic  test  function  ST  can  be  computed  by  the  Model  Building  Algorithm  in  one  scan 
of  the  spatial  self-join  of  the  data  set. 

Proof:  By  definition,  an  algebraic  function  F  of  a  data  set  can  be  computed  using  a  fixed 
number  of  distributive  aggregates  from  each  partition  of  the  data  set.  In  the  Model  Build¬ 
ing  algorithm,  a  fixed  k  number  of  distributive  aggregate  functions  D^ggr,  ^aggr-i  ■  ■  ■■>  ^aggr 
are  used  to  store  the  aggregate  values  in  memory,  and  the  algebraic  aggregate  functions 
are  then  computed  using  these  aggregate  values.  If  the  distributive  aggregate  functions 
can  fit  inside  the  memory  buffer,  the  algebraic  aggregate  functions  can  be  computed  using 
a  single  disk  scan  of  the  self-join  of  the  data  set.  Distributive  aggregate  functions  needed 
for  various  quantitative  spatial  outlier  definitions  are  shown  in  Table  2.  In  all  cases, 
one  needs  a  very  small  number  (less  than  a  dozen)  of  distributive  aggregate  functions  to 
compute  the  algebraic  aggregate  functions  needed  by  each  spatial  outlier  definition. 

3.2.2  Test  Result  Computation 

The  algebraic  aggregate  functions,  e.g.,  mean  and  standard  deviation^  computed  in  the 
Model  Building  algorithm  can  be  used  to  verify  the  spatial  outlier  of  incoming  data  sets. 
The  two  verification  algorithms  are  Route  Outlier  Detection  (ROD)  and  Random  Node 
Verification  (RNV).  The  ROD  algorithm  detects  the  spatial  outliers  from  a  user  specified 
route,  as  shown  in  Algorithm  2.  The  RNV  procedure  checks  the  outlierness  from  a  set 
of  randomly  generated  nodes.  The  step  to  detect  outliers  in  both  ROD  and  RNV  are 
similar,  except  that  the  RNV  has  no  shared  data  access  needs  across  tests  for  different 
nodes.  The  I/Os  for  FindJMeighborJModes_Set()  in  different  iterations  are  independent 
of  each  other  in  RNV.  We  note  that  the  operation  FindJMeighbor_Nodes_Set()  is  executed 
once  in  each  iteration  and  dominates  the  I/O  cost  of  the  entire  algorithm.  The  storage 
of  the  data  set  should  support  efficient  I/O  computation  of  this  operation.  We  discuss 
the  choice  for  storage  structure  and  provide  experimental  comparison  in  Sections  5  and 
6. 

Given  a  route  RN  within  the  data  set  S,  the  ROD  algorithm  first  retrieves  the  neigh¬ 
boring  nodes  from  S  for  each  data  object,  say  x,  in  the  route  RN]  then  it  computes 
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the  neighborhood  aggregate  function  using  the  attribute  value  of  x  and  the  at¬ 
tribute  values  of  x's  neighbors.  The  difference  function  F^iff  is  computed  using  the 
attribute  function  f{x),  neighborhood  aggregate  function  f^ggr^  the  algebraic  aggre¬ 
gate  functions  computed  in  the  Model  Building  algorithm.  Node  x  can  then  be  tested 
for  outlierness  using  the  statistical  test  function  ST. 

Route  Outlier  Detection(ROD)  Algorithm 

Input:  5  is  a  spatial  framework; 

/  is  an  attribute  function; 

N  is  the  neighborhood  relationship; 
faggr  is  s.  neighborhood  aggregate  function; 

Fdiff  is  a  difference  function; 

dS^sr-AosV’--- AosV  are  algebraic  aggregate  functions; 

ST  is  the  spatial  outlier  test  function; 

RN  is  the  set  of  node  in  a  route; 

Output:  Outlier_Set. 
for(i=l;i  <  \RN\  ;i++){ 

Oi=Get_One_Object(i,RN) ;  /*  Select  each  object  from  RN  */ 

NNS=F ind_NeighborJIodes_Set (.Oi,N ,S)  ; 

/*  Find  neighbor  nodes  of  Oi  from  S  */ 
for(j=l;j<  |NSS|;j++){ 

Oj=Get_One_Object(j  ,NNS) ;  /*  Select  each  neighbor  of  Oi  */ 
faggr  =  Compute_and_Aggregate (/(Oi), /(Oj) )  ! 

}; 

Fuff  =  Compute_Difference(/,/^g^,  . . . ,  T®*  ^) ; 

if  (ST{Fdiff,  . . . ,  T®*  ^)==  True)  { 

Add_Element(Outlier^et,i) ;  /*  Add  the  element  to  OutlierSet  */ 

} 

} 

return  Outlier _Set . 

Algorithm  2:  Pseudo-code  for  route  outlier  detection 


4  Analytical  Evaluation  and  Cost  Models 

The  computation  of  outlier  detection  algorithms  is  dominated  by  the  operation 
FindJMeighbor_Nodes_Set(),  which  is  determined  by  the  clustering  efficiency  (CE) 
parameter  of  disk  page  clustering.  In  this  section,  we  provide  simple  algebraic  cost 
models  for  the  I/O  cost  of  outlier  detection  operations,  using  the  CE  measure  of  physical 
page  clustering  methods.  The  CE  value  is  defined  as  follows: 

^  _  Total  number  of  unsplit  edges 

Total  numbe  of  edges 

The  CE  value  is  determined  by  the  disk  page  clustering  method,  the  data  record  size, 
and  the  disk  page  size.  Figure  7  gives  an  example  of  CE  value  calculation.  The  blocking 
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factor,  i.e.,  the  number  of  data  records  within  a  page  is  three,  and  there  are  nine  data 
records.  The  data  records  are  clustered  into  three  pages.  There  are  a  total  of  nine  edges 
and  six  unsplit  edges.  The  CE  value  of  this  graph  can  be  calculated  as  6/9  =  0.66. 


Figure  7:  Example  of  Clustering  Efficiency  (CE) 


Symbol 

Meaning 

a 

The  CE  value 

P 

Average  blocking  factor 

N 

Total  number  of  nodes 

L 

Number  of  nodes  in  a  route 

R 

Number  of  nodes  in  a  random  set 

A 

Average  number  of  neighbors  for  each  node 

Table  3:  Symbols  used  in  the  Cost  Analysis 

Table  3  lists  the  symbols  used  to  develop  our  cost  formulas,  ol  is  the  CE  value.  /? 
denotes  the  blocking  factor,  which  is  the  number  of  data  records  that  can  be  stored  in 
one  memory  page.  A  is  the  average  number  of  nodes  in  the  neighbor  list  of  a  node.  N  is 
the  total  number  of  nodes  in  the  data  set,  L  is  the  number  of  nodes  along  a  route,  and 
R  is  the  number  of  nodes  randomly  generated  by  users  for  spatial  outlier  verification. 

Lemma  4  The  cost  function  for  the  Model  Building  algorithm  is  Cmb  —  ^+N*A*{l—a) 

Proof:  The  Model  Building  algorithm  is  a  nest  loop  index  join.  Suppose  that  we  use 
two  memory  buffers:  one  memory  buffer  stores  the  data  object  x  used  in  the  outer  loop 
and  the  other  memory  buffer  is  reserved  for  processing  the  neighbors  of  x.  The  outer 
loop  retrieves  all  the  data  records  on  the  page  basis  and  has  an  aggregated  cost  of 
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For  each  node  x,  on  average,  cr  *  A  neighbors  are  in  the  same  page  as  x,  and  can  be 
processed  without  redundant  I/O.  Additional  data  page  accesses  are  needed  to  retrieve 
the  other  (1  —  cr)  *  A  neighbors,  and  it  takes  at  most  (1  —  cr)  *  A  data  page  accesses. 
Thus  the  expected  total  cost  for  the  inner  loop  is  *  A  *  (1  —  cr).  ■ 


Lemma  5  The  cost  function  for  the  ROD  algorithm  is  Crod  —  T*(l— q;)+L*A*(1— cr)  = 
L  *  (1  —  cr)  *  (1  +  A) 

Proof:  Assume  two  memory  buffers  are  used  for  the  ROD  algorithm;  one  memory  buffer 
is  reserved  for  processing  the  node  x  to  be  verified,  and  the  other  is  used  to  process  the 
neighbors  of  x.  For  each  node  x,  on  the  average,  its  successor  node  y  are  in  the  same  page 
as  X  with  probability  a,  and  can  be  processed  with  no  redundant  page  accesses.  The  cost 
to  access  all  the  nodes  along  a  route  is  L*  (1  —  cr).  To  process  the  neighbors  of  each  node, 
q;*A  neighbors  are  in  the  same  page  as  x.  Additional  data  page  accesses  are  needed  to  re¬ 
trieve  the  other  (1  — q;)*A  neighbors,  and  it  takes  at  most  (1  — q;)*A  data  page  accesses.  ■ 


Lemma  6  The  cost  function  for  the  RNV  algorithm  is  Crnv  —  R  +  R*A*{1  —  a) 

Proof:  Suppose  two  memory  buffers  are  used  for  the  RNV  algorithm;  one  memory  buffer 
is  reserved  for  processing  the  node  x  to  be  verified,  and  the  other  is  used  to  process  the 
neighbors  of  x.  Since  the  memory  buffer  is  assumed  to  be  cleared  for  each  consecutive 
random  node,  we  need  R  page  accesses  to  process  all  these  random  nodes.  For  each  node 
X,  a  *  A  neighbors  are  in  the  same  page  as  x,  and  can  be  processed  without  extra  I/O. 
Additional  data  page  accesses  are  needed  to  retrieve  the  other  (1  —  cr)  *  A  neighbors,  and 
it  takes  at  most  (1  —  cr)  *  A  data  page  accesses.  Thus,  the  expected  total  cost  to  process 
the  neighbor  of  R  nodes  is  i?  *  A  *  (1  —  cr). 

5  Experiment  Design 

In  the  spatial  outlier  detection  algorithm,  clustering  efiiciency(CE)  is  a  dominant  factor 
for  computation  cost.  The  CE  value  is  determined  by  the  disk  page  clustering  method. 
In  this  section,  we  describe  the  layout  of  our  experiments  and  illustrate  the  candidate 
data  clustering  methods. 

5.1  Experimental  Layout 

The  design  of  our  experiments  is  shown  is  Figure  8.  The  input  data  set,  a  Twin  Cities 
Highway  Connectivity  Graph,  was  provided  by  the  Minnesota  Department  of  Transporta¬ 
tion  and  physically  stored  into  data  pages  using  different  page  clustering  strategies  and 
page  sizes.  These  data  pages  were  then  precessed  to  generate  sets  of  pages  of  data  to 
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be  used  by  the  Model  Building  algorithm  and  Test  Result  Computation  algorithm.  The 
Model  Building  algorithm  computes  the  algebraic  aggregate  functions  to  be  used  by  the 
Test  Result  Computation  algorithm  to  detect  spatial  outliers. 

We  compared  three  different  data  page  clustering  schemes:  the  Connectivity- 
Clustered  Access  Method  (CCAM)  [25],  Z-ordering  [19],  and  Cell-tree  [7].  Other  pa¬ 
rameters  of  interest  were  the  size  of  the  memory  buffer,  the  buffering  strategies,  the 
memory  block  size  (page  size),  and  the  number  of  neighbors.  The  experimental  measures 
for  the  Model  Building  procedure  and  the  Test  Result  Computation  procedures  are  the 
CE  value  and  I/O  cost. 


Figure  8:  Experimental  Layout 


The  experiments  were  conducted  on  many  spatial  frameworks.  We  present  the  results 
on  a  representative  framework,  which  is  a  spatial  network  with  990  nodes  that  represents 
the  traffic  detector  stations  for  a  20-square-mile  section  of  the  Twin  Cities  area.  We  used 
a  common  record  type  for  all  the  clustering  methods.  Each  record  contains  a  node  and 
its  neighbor-list,  i.e.,  a  successor-list  and  a  predecessor-list.  The  size  of  each  record  is 
256  bytes. 

5.2  Candidate  Clustering  Methods 

In  this  section,  we  describe  the  candidate  clustering  methods  used  in  the  experiments. 

Connectivity-Clustered  Access  Method(CCAM):  CCAM  [25]  clusters  the 
nodes  of  the  graph  via  graph  partitioning,  e.g..  Metis  [11,  12].  Other  graph-partitioning 
methods  can  also  be  used  as  the  basis  of  our  scheme.  In  addition,  an  auxiliary  secondary 
index  is  used  to  support  query  operations.  The  choice  of  a  secondary  index  can  be  tai¬ 
lored  to  the  application.  Since  the  benchmark  graph  was  embedded  in  graphical  space, 
we  used  the  tree  with  Z-oidei  in  our  experiments.  Other  access  methods  such  as  the 
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R-tree  and  Grid  File  can  alternatively  be  created  on  top  of  the  data  file  as  secondary 
indices  in  CCAM  to  suit  the  application.  In  Figure  9,  a  simple  graph  and  its  CCAM  are 
shown.  The  left  half  of  Figure  9  shows  a  spatial  graph.  Nodes  are  annotated  with  the 
node-id  and  geographical  coordinates.  The  node-id  is  an  integer  representing  the  Z-order 
of  the  (x,y)  coordinates.  For  example,  the  node  with  the  coordinates  (1,1)  gets  a  node-id 
of  3.  The  solid  lines  that  connect  nodes  represent  edges.  The  dashed  lines  show  the 
cuts  and  partitioning  of  the  spatial  graph  into  data  pages.  The  partitions  are  (0,1, 4, 5), 
(3, 6, 8, 9),  (7,12,13,18),  and  (11,14,15,26).  The  right  half  of  Figure  9  shows  the  data  pages 
and  the  secondary  index.  Nodes  in  the  same  partition  set  are  stored  on  the  same  data 
page. 

Linear  Clustering  by  Z-order:  Z-order  [19]  utilizes  spatial  information  while  im¬ 
posing  a  total  order  on  the  points.  The  Z-order  of  a  coordinate  (x,r/)  is  computed  by 
interweaving  the  bits  in  the  binary  representation  of  the  two  values.  Alternatively,  Hilbert 
ordering  may  be  used.  A  conventional  one-dimensional  primary  index  (e.g.  R+-tree)  can 
be  used  to  facilitate  a  search.  Figure  10  shows  an  example  of  using  Z-order  as  the  page 
clustering  method  and  B-tree  as  the  primary  index  for  accessing  the  data  file.  The  nodes 
of  different  partitions  are  (0,1, 3, 4),  (5, 6, 7,8),  (9,11,12,13),  and  (14,15,18,26). 

Cell  Tree:  A  Cell  tree  [7]  is  a  height-balanced  tree.  Each  cell  tree  node  corresponds 
not  necessarily  to  a  rectangular  box  but  to  a  convex  polyhedron.  A  cell  tree  restricts 
polyhedra  to  partitions  of  a  BSP  (Binary  Space  Partitioning)  in  order  to  avoid  overlaps 
among  sibling  polyhedra.  Each  cell-tree  node  corresponds  to  one  disk  space,  and  the  leaf 
nodes  contain  all  the  information  required  to  answer  a  given  search  query.  The  cell-tree 
can  be  viewed  as  a  combination  of  a  BSP-  and  i?’''-tree,  or  as  a  BSP-tree  mapped  on 
paged  secondary  memory.  Figure  11  is  an  example  of  using  Cell  Tree  to  cluster  the  node. 
The  first  level  binary  partition  is  line  HI,  and  the  second  level  partitions  are  lines  H2 
and  H3.  The  partitions  are  (0,1, 3, 6),  (4,5,7,18),  (8,9,11,14),  and  (12,13,15,26). 


Figure  9:  CCAM  Clustering  Method 
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5.3  Candidate  Buffering  Strategies 

We  evaluated  three  buffering  strategies  to  replace  the  page  in  the  memory  buffer.  The 
simplest  page  replacement  algorithm  is  the  First  In  First  Out  (FIFO)  algorithm.  A  FIFO 
replacement  algorithm  marks  the  time  when  each  page  was  bought  into  the  memory 
buffer.  When  a  page  must  be  replaced,  the  oldest  page  is  chosen.  The  Least  Recently 
Used  (LRU)  algorithm  selects  the  page  that  has  not  been  referenced  for  the  longest  period 
of  time  for  replacement.  In  contrast,  The  Most  Recently  Used  (MRU)  algorithm  replaces 
the  page  which  has  been  just  recently  referenced. 


6  Experimental  Observations  and  Results 

In  this  section,  we  illustrate  outlier  examples  detected  in  the  traffic  data  set,  present  the 
results  of  our  experiments,  and  test  the  effectiveness  of  different  page  clustering  methods. 
To  simplify  the  comparison,  the  I/O  cost  represents  the  number  of  data  pages  accessed. 
This  represents  the  relative  performance  of  the  various  methods  for  very  large  databases. 
For  smaller  databases,  the  I/O  cost  associated  with  the  indices  should  be  measured. 
We  examined  the  CE  measures  in  the  set  of  experiments  that  deals  with  range  outlier 
detection  queries. 

6.1  Outliers  Detected 

We  tested  the  effectiveness  of  our  algorithm  on  the  Twin-Cities  traffic  data  set  and  detect 
numerous  outliers,  as  described  in  the  following  examples. 

Figure  12  shows  one  example  of  traffic  flow  outliers.  Figures  12(a)  and  (b)  are  the 
traffic  volume  maps  for  I-35W  north  bound  and  south  bound,  respectively,  on  1/21/1997. 
The  X-axis  is  a  5-minute  time  slot  for  the  whole  day  and  the  Y-axis  is  the  label  of  the 
stations  installed  on  the  highway,  starting  from  1  on  the  north  end  to  61  on  the  south  end. 
The  abnormal  white  line  at  2:45PM  and  the  white  rectangle  from  8:20AM  to  10:00AM 
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Figure  10:  Z-order  Clustering  Method 
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on  the  X-axis  and  between  stations  29  to  34  on  the  Y-axis  can  be  easily  observed  from 
both  (a)  and  (b).  The  white  line  at  2:45PM  is  an  instance  of  temporal  outliers,  where  the 
white  rectangle  is  a  spatial-temporal  outlier.  Moreover,  station  9  in  Figure  12(a)  exhibits 
inconsistent  traffic  flow  compared  with  its  neighboring  stations,  and  was  detected  as  a 
spatial  outlier. 

6.2  Evaluation  of  the  Proposed  Cost  Model 

We  evaluated  the  I/O  cost  for  different  clustering  methods  for  outlier  detection  pro¬ 
cedures,  namely.  Model  Building  (MB),  Route  Outlier  Detection  (ROD)  and  Random 
Node  Verification  (RNV).  The  experiments  used  Twin-Cities  traffic  data  with  page  size 
IK  bytes,  and  two  memory  buffers.  Table  4  shows  the  number  of  data  page  accesses 
for  each  procedure  under  various  clustering  methods.  The  CE  value  for  each  method  is 
also  listed  in  the  table.  The  cost  function  for  MB  is  Cmb  —  ^  +  N  *  A  *  {1  —  a).  The 
cost  function  for  RNV  is  Crnv  —  R  +  R*A*(1  —  cr).  The  cost  function  for  ROD  is 
Crod  =  T  *  (1  —  cr)  *  (1  +  A),  as  described  in  Section  4.2. 


Clustering 

Method 

Parameters  Computation 

Random  Node  Verification 

Route  Outlier  Detect 

a  = 

CE 

Actual 

Predicted 

Actual 

Predicted 

Actual 

Predicted 

CCAM 

628 

687 

241 

246 

30 

36 

0.68 

Cell-tree 

834 

919 

279 

291 

45 

53 

0.53 

Z-order 

1263 

1269 

349 

357 

78 

79 

0.31 

N  =  773,  L  =  38,  R  =  150,  /?  =  4,  A  =  2 


Table  4:  The  Actual  I/O  Cost  and  Predicted  Cost  Model  for  Different  Clustering  Methods 

As  shown  in  Table  3,  CCAM  produced  the  lowest  number  of  data  page  accesses  for 
the  outlier  detection  procedures.  This  is  to  be  expected,  since  CCAM  generated  the 
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Figure  11:  Cell-tree  Clustering  Method 
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(a)  I-35W  North  Bound 


(b)  I-35W  South  Bound 


Figure  12:  An  Example  of  an  Outlier 


highest  CE  value. 

6.3  Evaluation  of  I/O  Cost  for  the  Model  Building  Algorithm 

In  this  section,  we  present  the  results  of  our  evaluation  of  the  I/O  cost  and  CE  value  for 
alternative  clustering  methods  while  computing  the  Model.  The  parameters  of  interest 
are  buffer  size,  page  size,  number  of  neighbors,  and  neighborhood  depth. 

The  Effect  of  Buffering:  We  evaluated  the  effect  of  buffering  on  the  performance  of 
the  page  clustering  methods  and  buffer  replacement  strategies.  The  variable  parameters 
were  the  number  of  buffers  available.  Figure  13(a)  shows  the  effect  of  buffering  on  the 
performance  of  model  construction  for  various  clustering  methods  with  hxed  page  size  2 
Kbytes.  As  can  be  seen,  the  performance  improves  as  the  number  of  buffers  increases. 
The  performance  ranking  for  each  clustering  methods  remains  the  same  for  different 
buffer  sizes.  Figure  13(b)  demonstrates  the  effect  of  different  buffering  strategies  on  the 
number  of  page  accesses.  When  the  buffer  size  is  small  (e.g.,  4-8),  the  LRU  algorithm 
has  the  best  performance.  As  the  number  of  buffers  increases  to  greater  than  10,  both 
FIFO  and  LRU  have  better  performance  than  MRU. 

The  Effect  of  Page  Size  and  CE  Value:  Figures  14  (a)  and  (b)  show  the 
number  of  data  pages  accessed  and  the  CE  values  respectively,  for  different  page  clustering 
methods,  as  the  page  sizes  change.  The  buffer  size  is  hxed  at  32  Kbytes.  As  can  be  seen, 
a  higher  CE  value  implies  a  lower  number  of  data  page  accesses,  as  predicted  in  the  cost 
model.  CCAM  outperforms  the  other  competitors  for  all  four  page  sizes,  and  Cell-tree 
has  better  performance  than  Z-order  clustering. 

The  Effect  of  Neighborhood  Cardinality:  We  evaluated  the  effect  of  varying 
the  number  of  neighbors  and  the  depth  of  neighbors  for  different  page  clustering  methods. 
The  neighborhood  depth  dehnes  the  levels  of  the  neighborhood  relationship.  When  the 
neighborhood  depth  D  is  set  to  one,  only  directly  connected  nodes  are  considered  as 
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Figure  13:  Effect  of  Buffering 


neighbors;  when  D  is  set  to  be  greater  than  one,  then  node  n2  is  considered  a  neighbor 
of  node  rii  provided  there  is  a  path  connecting  rii  to  n2  with  number  of  edges  less  than 
or  equal  to  D.  For  example,  there  are  three  nodes,  Ua,  and  n^,  with  directed  graph 
relationship:  edge(na,nh)  and  edge(nh,nc).  If  the  neighborhood  depth  is  set  to  two,  node 
Uc  will  be  considered  as  neighbor  of  node  Ua  due  to  a  path  of  length  two  via  node 
We  hxed  the  page  size  at  1  Kbytes,  buffer  size  at  4  Kbytes,  and  used  the  LRU  buffering 
strategy.  Figure  15  shows  the  number  of  page  accesses  as  the  number  of  neighbors  for 
each  node  increases  from  2  to  10.  CCAM  has  better  performance  than  Z-order  and 
Cell-tree.  The  performance  ranking  for  each  page  clustering  method  remains  the  same 
for  different  numbers  of  neighbors.  Figure  15  shows  the  number  of  page  accesses  as  the 
neighborhood  depth  increases  from  1  to  5.  CCAM  has  better  performance  than  Z-order 
and  Cell-tree  for  all  the  neighborhood  depths. 

6.4  Evaluation  of  I/O  cost  for  ROD  algorithm 

We  evaluated  the  performance  for  different  page  clustering  methods,  page  size,  and  buffer 
size  when  users  request  an  outlier  detection  along  a  given  route  (e.g.,  I-35W  north  bound) 
on  a  highway.  We  also  evaluated  the  performance  of  RNV  algorithm.  The  experiment 
results  showed  the  similar  trend  as  the  ROD  algorithm. 

The  effect  of  buffering:  We  evaluated  the  effect  of  buffering  for  the  outlier 

detection  along  a  route.  Figure  16  shows  the  number  of  page  accesses  as  we  increase  the 
buffer  number  from  2  to  8.  As  can  be  seen,  the  increase  of  buffer  size  does  not  improve 
the  performance  after  a  certain  buffer  size,  and  CCAM  has  the  best  performance. 


23 


Page  Size  (K  bytes) 
(a)  Page  Accesses 


Page  Size  (K  Bytes) 

(b)  Clustering  Efficiency 
(CE) 


Figure  14:  Effect  of  Page  Size  on  Data  Page  Accesses  and  Clustering  Efficiency  (Buffer 
Size  =  32  Kbytes) 


The  effect  of  page  size  and  CE  value:  Figures  17  (a)  and  (b)  show  the  number  of 
data  pages  accessed  and  the  CE  values  respectively,  for  different  page  clustering  methods, 
as  the  page  sizes  change.  The  buffer  size  is  hxed  at  4  Kbytes.  As  can  be  seen,  a  higher 
CE  value  implies  a  lower  number  of  data  page  accesses,  as  predicted  in  the  cost  model. 
CCAM  outperforms  the  other  competitors  for  all  three  page  sizes.  Note  that  the  Cell 
tree  has  a  CE  value  of  0  and  generates  the  highest  number  of  page  accesses  when  page 
size  is  0.5  Kbytes  and  record  size  is  256  bytes.  Cell-tree  clusters  stations  by  Euclidean 
distance  even  when  there  is  no  edge  connecting  the  stations.  This  can  lead  to  low  CE 
values  and  CE  value  of  0  when  each  data  page  (disk  block)  can  hold  only  two  records. 


7  Conclusions  and  Future  Work 

In  this  paper,  we  focus  on  detecting  spatial  outliers  in  spatial  data  sets.  We  propose 
a  dehnition  of  iS-outliers  which  generalizes  traditional  spatial  outliers;  we  also  analyze 
computation  structures  for  detecting  spatial  outliers,  design  efficient  algorithms  to  detect 
outliers,  provide  cost  models  for  outlier  detection  procedures,  and  compare  the  perfor¬ 
mance  of  our  approach  using  different  data  clustering  approaches.  In  addition,  we  pro¬ 
vide  experimental  results  from  the  application  of  our  algorithm  on  a  Twin  Cities  traffic 
archival  to  show  its  effectiveness  and  usefulness. 

We  have  evaluated  alternative  clustering  methods  for  neighbor  outlier  query  process¬ 
ing,  including  model  building,  random  node  verihcation,  and  route  outlier  detection. 
Our  experimental  results  show  that  the  connectivity-clustered  access  method  (CCAM), 
which  achieves  the  highest  clustering  efficiency  (CE)  value,  provides  the  best  overall 
performance. 
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No  of  Neighbors 


(a)  Number  of  Neighbors 


Neighborhood  Depth 
(b)  Neighborhood  Depth 


Figure  15:  Effect  of  Neighborhood  Cardinality  on  Data  Page  Accesses  (Page  Size  =  1 
Kbytes,  Buffer  Size  =  4  Kbytes) 


Our  algorithm  is  designed  to  detect  spatial  outiers  outliers  using  a  single  non-spatial 
attribute  from  a  data  set.  We  are  planning  to  investigate  spatial  outliers  with  multiple 
non-spatial  attributes,  such  as  the  combination  of  volume,  occupancy,  and  speed  in  the 
traffic  data  set.  For  multiple  attributes,  the  dehnition  of  spatial  neighborhood  will  be  the 
same,  but  the  neighborhood  aggregate  function,  comparison  function,  and  statistic  test 
function  need  to  be  redehned.  The  key  challenge  is  to  dehne  a  general  distance  function 
in  a  multi-attribute  data  space. 

We  will  also  explore  graphical  methods  for  spatial  outlier  detection.  The  key  issue  is 
to  facilitate  the  visualization  of  spatial  relationships  while  highlighting  spatial  outliers. 
For  instance,  in  variogram  cloud  and  scatterplot  visualizations,  the  spatial  relationship 
between  a  single  spatial  outlier  and  its  neighbors  is  not  obvious.  It  is  necessary  to  transfer 
the  information  back  to  the  original  map  to  check  neighbor  relationships.  As  a  single 
spatial  outlier  tends  to  flag  not  only  the  spatial  location  of  local  instability  but  also  its 
neighboring  locations,  it  is  important  to  group  flagged  locations  and  identify  real  spatial 
outliers  from  the  group  in  the  post-processing  step. 

Although  spatial  outlier  detection  is  the  focus  of  this  paper.  Figure  12  shows  other 
types  of  outliers,  such  as  temporal  outliers  and  spatial-temporal  outliers.  While  our 
proposed  algorithm  can  efficiently  detect  spatial  outliers,  temporal  and  spatial-temporal 
outliers  are  detected  by  post-processing  and  data  visualization.  We  are  planning  to 
investigate  the  dehnitions  of  temporal  and  spatial-temporal  outliers,  as  well  as  to  expand 
our  algorithm  to  directly  detect  these  outliers. 
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Figure  16:  Effect  of  Buffering 
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A  Characterizing  the  Distribution  of  the  Statistic 

Theorem  2  Spatial  Statistic  S(x)  =  [f{x)  —  iv))]  is  normally  distributed  if 

attribute  value  f{x)  is  normally  distributed. 

Proof: 

For  a  spatially  referenced  object  Xi,  let  Xi  be  a  random  variable  from  the  normally 
distributed  attribute  function  f{x)  ~  -/V(/r,  a^),  where  /r  is  the  mean  and  a  is  the  standard 
deviation. 

Xii,Xi2, . . . ,  Xik  are  k  neighbors  of  Xi.  Attribute  variables  Xu,  X12, . . . ,  Xik  of  objects 
Xii,Xi2, . . .  ,Xik  are  normally  distributed  from  aff),  1  <  i  <  k  respectively.  Now 

let  us  consider  two  conditions  of  neighborhoods  as  follows: 

(l)Assume  i.i.d.  in  the  local  neighborhood  window 
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(2) Consider  the  spatial  correlation  in  the  local  neighborhood  window 
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Based  on  the  definition  of  neighborhood,  for  each  spatially  referenced  object  x,  the 
average  attribute  values  Ey^pf(^x){f{y))  of  x' s  k  neighbors  can  be  derived  from  f{x).  Since 
the  attribute  function  f{x)  is  normally  distributed  and  an  average  of  normal  variables  is 
also  normally  distributed[4],  the  average  attribute  values  Ey^pf(^x){f  {y))  over  neighbors  is 
also  a  normal  distribution  for  a  fixed  cardinality  neighborhood. 

We  apply  a  normal  linear  transformation  to  derive  5'(x),  and  the  transformation 
guarantees  the  normal  distribution  [4].  Let  X^  be  a  random  variable  of  Ey^pff^x) 
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Since  the  attribute  value  and  the  average  attribute  value  over  neighbors  are  two 
normal  variables,  the  distribution  of  the  difference  S{x)  of  each  data  object  x  and  the 
average  attribute  value  of  x's  neighbors  is  also  normally  distributed.  ■ 
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B  Aggregate  Function 


B.l  Distributive  Aggregate  Function 


An  aggregate  function  F  is  called  distributive  if  there  exists  a  function  G  such  that  the 
value  of  F  for  a  data  set  can  be  computed  by  applying  a  G  function  to  the  value  of  F  in 
each  partition  of  the  whole  data  set.  In  most  cases,  F  =  G.  For  example,  in  Figure  18, 
the  computation  of  distributive  aggregate  functions  for  a  two-dimensional  matrix  M, 
F{Mij)  =  G{F{Gj))  =  G{F{Ri)),  where  My  represents  the  elements  of  a  two-dimensional 
matrix,  Gj  denotes  each  column  of  the  matrix,  and  Ri  denotes  each  row  of  the  matrix. 
Consider  the  aggregate  functions  Min  and  Gount.  In  the  first  example,  F  =  Min  and 
G  =  Min,  since  Min{Mij)  =  Min{Min{Gj))  =  Min{Min{Ri)).  In  the  second  example, 
F  =  Gount,  G  =  Sum,  since  Gount{Mij)  =  Sum{Gount{Gj))  =  Sum{Gount{Ri)). 
Other  distributive  aggregate  functions  include  Max  and  Sum.  Note  that  “null”  valued 
elements  are  ignored  in  computing  aggregate  functions. 


Distributive  Aggregate  Function:  Min  Distributive  Aggregate  Function:  Count 
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Figure  18:  Computation  of  distributive  aggregate  functions 


B.2  Algebraic  Aggregate  Function 

An  aggregate  function  F  is  algebraic  if  F  of  a  data  set  can  be  computed  using  a  fixed  num¬ 
ber  of  sub-aggregates  from  each  partition  of  the  data  set.  Average,  variance,  standard 
deviation,  maxN,  minN  are  all  algebraic  aggregate  functions.  In  Figure  19,  for  example, 
the  computations  of  average  and  variance  for  a  two-dimensional  matrix  M  are  shown. 
The  average  of  elements  in  the  data  set  M  can  be  computed  from  sum  and  count  values 
of  the  one  dimensional  sub-matrix  (e.g.,  rows  or  columns).  The  variance  can  be  derived 
from  the  count,  sum{i.e.  sumof  sq{i.e.  rows  or  columns.  Similar 

techniques  apply  to  other  algebraic  functions. 

B.3  Holistic  Aggregate  Function 

An  aggregate  function  F  is  called  holistic  if  the  value  of  F  for  a  data  set  cannot  be 
computed  using  a  constant  number  of  sub- aggregates  from  each  partition  of  the  data  set. 
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Algebraic  Aggregate  Function:  Average 
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Algebraic  Aggregate  Function:  Variance 
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Figure  19:  Computation  of  algebraic  aggregate  functions 


To  compute  the  value  of  F,  we  need  to  access  the  whole  data  set.  Examples  of  holistic 
function  include  median,  mostFrequent,  and  rank. 
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